This project serves as a final milestone to attain the Google Data Analytics Professional Certificate. It involves the case study on Bellabeat, a tech wellness company that manufactures health-focused smart products for women. Bellabeat offer a range of smart devices that collects various health and lifestyle data to empower women with knowledge about their own health and habits. The smart devices work hand in hand with the Bellabeat app to provide users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. Thus, users will be able to better understand their current habits and make healthy decisions.
The objective of this study is to analyze consumers' usage data on non-Bellabeat smart devices and determine how it could unlock new growth opportunities for Bellabeat. The insights drawn will be used to develop high level recommendations for Bellabeat's marketing strategy.
In this project, the exploratory data analysis (EDA) approach will be used to analyze and investigate for trends, patterns, and relationships to derive insights from the dataset. This will be guided through the process of Ask, Prepare, Process, Analyze, Share, and Act using the Python programming language.
The aim of this project is to draw insights into how consumers use non-Bellabeat smart devices and develop high level recommendations for Bellabeat's marketing strategy with the following questions:
Stakeholders
The Fitbit Fitness Tracker Data from the Kaggle web repository will be used for this analysis.
The dataset is confirmed to be open-source and licensed under the CC0: Public Domain. The owner has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. The dataset can be copied, modified, distributed, and used for analysis, even for commercial purposes, all without asking permission.
The dataset is generated by respondents to a distributed survey via Amazon Mechanical Turk over 31 days between 03.12.2016 - 05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.
The dataset consists of 18 CSV files in total with each containing various health and activity metrics tracked by Fitbit. Using the elimination approach to remove irrelevant dataframes from the analysis, a total of 13 dataframes were eliminated as they are either duplicates of a larger dataframe, too few of a sample size, or contain data that were not meaningful for the analysis.
Hence, here are the 5 dataframes that will be used for this analysis:
| Table Name | Type | Description |
|---|---|---|
| 1. dailyActivity_merged | Microsoft Excel CSV | Daily Activity over 31 days of 33 IDs. Tracking daily: Steps, Distance, Intensities, Calories |
| 2. hourlyCalories_merged | Microsoft Excel CSV | Hourly Calories burned over 31 days of 33 IDs |
| 3. hourlyIntensities_merged | Microsoft Excel CSV | Hourly total and average intensity over 31 days of 33 IDs |
| 4. hourlySteps_merged | Microsoft Excel CSV | Hourly Steps over 31 days of 33 IDs |
| 5. sleepDay_merged | Microsoft Excel CSV | Daily sleep logs, tracked by: Total count of sleeps a day, Total minutes, Total Time in Bed of 24 IDs |
The dataset had the limitation of having too small of a sample size (30 users) that may not represent the entire population and may render conclusions drawn from the analysis to be invalid. Furthermore, demographical information such as age, gender, and ethnicity that is crucial to determine the strategy on Bellabeat's target market were not provided in the dataset.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import datetime as dt
# Loading the data into the pandas data frame.
daily_activity = pd.read_csv('dailyActivity_merged.csv')
hourly_calories = pd.read_csv('hourlyCalories_merged.csv')
hourly_intensities = pd.read_csv('hourlyIntensities_merged.csv')
hourly_steps = pd.read_csv('hourlySteps_merged.csv')
sleep_day = pd.read_csv('sleepDay_merged.csv')
# Displaying the top 5 rows of each dataset
print('\033[1m' + 'daily_activity')
display(daily_activity.head())
print('\033[1m' + 'hourly_calories')
display(hourly_calories.head())
print('\033[1m' + 'hourly_intensities')
display(hourly_intensities.head())
print('\033[1m' + 'hourly_steps')
display(hourly_steps.head())
print('\033[1m' + 'sleep_day')
display(sleep_day.head())
daily_activity
| Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1503960366 | 4/12/2016 | 13162 | 8.50 | 8.50 | 0.0 | 1.88 | 0.55 | 6.06 | 0.0 | 25 | 13 | 328 | 728 | 1985 |
| 1 | 1503960366 | 4/13/2016 | 10735 | 6.97 | 6.97 | 0.0 | 1.57 | 0.69 | 4.71 | 0.0 | 21 | 19 | 217 | 776 | 1797 |
| 2 | 1503960366 | 4/14/2016 | 10460 | 6.74 | 6.74 | 0.0 | 2.44 | 0.40 | 3.91 | 0.0 | 30 | 11 | 181 | 1218 | 1776 |
| 3 | 1503960366 | 4/15/2016 | 9762 | 6.28 | 6.28 | 0.0 | 2.14 | 1.26 | 2.83 | 0.0 | 29 | 34 | 209 | 726 | 1745 |
| 4 | 1503960366 | 4/16/2016 | 12669 | 8.16 | 8.16 | 0.0 | 2.71 | 0.41 | 5.04 | 0.0 | 36 | 10 | 221 | 773 | 1863 |
hourly_calories
| Id | ActivityHour | Calories | |
|---|---|---|---|
| 0 | 1503960366 | 4/12/2016 12:00:00 AM | 81 |
| 1 | 1503960366 | 4/12/2016 1:00:00 AM | 61 |
| 2 | 1503960366 | 4/12/2016 2:00:00 AM | 59 |
| 3 | 1503960366 | 4/12/2016 3:00:00 AM | 47 |
| 4 | 1503960366 | 4/12/2016 4:00:00 AM | 48 |
hourly_intensities
| Id | ActivityHour | TotalIntensity | AverageIntensity | |
|---|---|---|---|---|
| 0 | 1503960366 | 4/12/2016 12:00:00 AM | 20 | 0.333333 |
| 1 | 1503960366 | 4/12/2016 1:00:00 AM | 8 | 0.133333 |
| 2 | 1503960366 | 4/12/2016 2:00:00 AM | 7 | 0.116667 |
| 3 | 1503960366 | 4/12/2016 3:00:00 AM | 0 | 0.000000 |
| 4 | 1503960366 | 4/12/2016 4:00:00 AM | 0 | 0.000000 |
hourly_steps
| Id | ActivityHour | StepTotal | |
|---|---|---|---|
| 0 | 1503960366 | 4/12/2016 12:00:00 AM | 373 |
| 1 | 1503960366 | 4/12/2016 1:00:00 AM | 160 |
| 2 | 1503960366 | 4/12/2016 2:00:00 AM | 151 |
| 3 | 1503960366 | 4/12/2016 3:00:00 AM | 0 |
| 4 | 1503960366 | 4/12/2016 4:00:00 AM | 0 |
sleep_day
| Id | SleepDay | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed | |
|---|---|---|---|---|---|
| 0 | 1503960366 | 4/12/2016 12:00:00 AM | 1 | 327 | 346 |
| 1 | 1503960366 | 4/13/2016 12:00:00 AM | 2 | 384 | 407 |
| 2 | 1503960366 | 4/15/2016 12:00:00 AM | 1 | 412 | 442 |
| 3 | 1503960366 | 4/16/2016 12:00:00 AM | 2 | 340 | 367 |
| 4 | 1503960366 | 4/17/2016 12:00:00 AM | 1 | 700 | 712 |
Now we will get an overview (number of entries, null values, column names) of the dataframes and check for any incorrect data types.
print('\033[1m' + 'daily_activity' + '\033[0m')
daily_activity.info()
daily_activity
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 940 non-null int64
1 ActivityDate 940 non-null object
2 TotalSteps 940 non-null int64
3 TotalDistance 940 non-null float64
4 TrackerDistance 940 non-null float64
5 LoggedActivitiesDistance 940 non-null float64
6 VeryActiveDistance 940 non-null float64
7 ModeratelyActiveDistance 940 non-null float64
8 LightActiveDistance 940 non-null float64
9 SedentaryActiveDistance 940 non-null float64
10 VeryActiveMinutes 940 non-null int64
11 FairlyActiveMinutes 940 non-null int64
12 LightlyActiveMinutes 940 non-null int64
13 SedentaryMinutes 940 non-null int64
14 Calories 940 non-null int64
dtypes: float64(7), int64(7), object(1)
memory usage: 110.3+ KB
print('\033[1m' + 'hourly_calories' + '\033[0m')
hourly_calories.info()
hourly_calories
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 22099 non-null int64
1 ActivityHour 22099 non-null object
2 Calories 22099 non-null int64
dtypes: int64(2), object(1)
memory usage: 518.1+ KB
print('\033[1m' + 'hourly_intensities' + '\033[0m')
hourly_intensities.info()
hourly_intensities
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 22099 non-null int64
1 ActivityHour 22099 non-null object
2 TotalIntensity 22099 non-null int64
3 AverageIntensity 22099 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 690.7+ KB
print('\033[1m' + 'hourly_steps' + '\033[0m')
hourly_steps.info()
hourly_steps
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 22099 non-null int64
1 ActivityHour 22099 non-null object
2 StepTotal 22099 non-null int64
dtypes: int64(2), object(1)
memory usage: 518.1+ KB
print('\033[1m' + 'sleep_day' + '\033[0m')
sleep_day.info()
sleep_day
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413 entries, 0 to 412
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 413 non-null int64
1 SleepDay 413 non-null object
2 TotalSleepRecords 413 non-null int64
3 TotalMinutesAsleep 413 non-null int64
4 TotalTimeInBed 413 non-null int64
dtypes: int64(4), object(1)
memory usage: 16.3+ KB
Notice that the data types of the ActivityDate, ActivityHour, and SleepDay columns are in the object format. We will convert them to the date-time format later on (Section 4.4.3).
The process involves:
# Identifying number of duplicates in each dataframe
duplicates_daily_activity = print("daily_activity=",daily_activity.duplicated().sum())
duplicates_hourly_calories = print("hourly_calories=",hourly_calories.duplicated().sum())
duplicates_hourly_intensities = print("hourly_intensities=",hourly_intensities.duplicated().sum())
duplicates_hourly_steps = print("hourly_steps=",hourly_steps.duplicated().sum())
duplicates_sleep_day= print("sleep_day=",sleep_day.duplicated().sum())
daily_activity= 0 hourly_calories= 0 hourly_intensities= 0 hourly_steps= 0 sleep_day= 3
Found 3 duplicates in the sleep_activity dataframe.
# Extracting the duplicated rows in sleep_day dataframe
sleep_day.loc[sleep_day.duplicated(), :]
| Id | SleepDay | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed | |
|---|---|---|---|---|---|
| 161 | 4388161847 | 5/5/2016 12:00:00 AM | 1 | 471 | 495 |
| 223 | 4702921684 | 5/7/2016 12:00:00 AM | 1 | 520 | 543 |
| 380 | 8378563200 | 4/25/2016 12:00:00 AM | 1 | 388 | 402 |
#Dropping the duplicates
sleep_day.drop_duplicates()
| Id | SleepDay | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed | |
|---|---|---|---|---|---|
| 0 | 1503960366 | 4/12/2016 12:00:00 AM | 1 | 327 | 346 |
| 1 | 1503960366 | 4/13/2016 12:00:00 AM | 2 | 384 | 407 |
| 2 | 1503960366 | 4/15/2016 12:00:00 AM | 1 | 412 | 442 |
| 3 | 1503960366 | 4/16/2016 12:00:00 AM | 2 | 340 | 367 |
| 4 | 1503960366 | 4/17/2016 12:00:00 AM | 1 | 700 | 712 |
| ... | ... | ... | ... | ... | ... |
| 408 | 8792009665 | 4/30/2016 12:00:00 AM | 1 | 343 | 360 |
| 409 | 8792009665 | 5/1/2016 12:00:00 AM | 1 | 503 | 527 |
| 410 | 8792009665 | 5/2/2016 12:00:00 AM | 1 | 415 | 423 |
| 411 | 8792009665 | 5/3/2016 12:00:00 AM | 1 | 516 | 545 |
| 412 | 8792009665 | 5/4/2016 12:00:00 AM | 1 | 439 | 463 |
410 rows × 5 columns
Note: sleep_day dataframe started with 413 entries and now it is at 410 entries after removing the 3 duplicates.
Here we found no nulls within the dataframes, thus the removal of nulls is not needed.
# Total number of null values
print("daily_activity =", daily_activity.isnull().sum().sum())
print("hourly_calories =", hourly_calories.isnull().sum().sum())
print("hourly_intensities =", hourly_intensities.isnull().sum().sum())
print("hourly_steps =", hourly_steps.isnull().sum().sum())
print("sleep_activity =", sleep_day.isnull().sum().sum())
daily_activity = 0 hourly_calories = 0 hourly_intensities = 0 hourly_steps = 0 sleep_activity = 0
As identified in Section 4.3, the timestamp columns of the respective dataframes are in the 'object' format. We would want to convert them into the 'date-time'format and display the dates in "yyyy-mm-dd". The Date and Time columns of the sleep_day dataframe will be split to merge with the daily_activity dataframe later.
# Convert to date-time format
daily_activity['ActivityDate'] = pd.to_datetime(daily_activity['ActivityDate'])
hourly_calories['ActivityHour'] = pd.to_datetime(hourly_calories['ActivityHour'])
hourly_intensities['ActivityHour'] = pd.to_datetime(hourly_intensities['ActivityHour'])
hourly_steps['ActivityHour'] = pd.to_datetime(hourly_steps['ActivityHour'])
sleep_day['Date'] = pd.to_datetime(sleep_day['SleepDay'])
sleep_day['Time'] = pd.to_datetime(sleep_day['SleepDay']).dt.time
# Rearranging columns
sleep_day = sleep_day[['Id','Date','Time','TotalSleepRecords','TotalMinutesAsleep','TotalTimeInBed']]
# Rename ActivityDate column
daily_activity = daily_activity.rename(columns={'ActivityDate': 'Date'})
#Adding DayOfWeek Column
daily_activity['DayOfWeek'] = pd.to_datetime(daily_activity['Date']).dt.day_name()
#Rearraning the column names of the dataframe
DayOfWeek = daily_activity['DayOfWeek']
daily_activity = daily_activity.drop(columns=['DayOfWeek'])
daily_activity.insert(loc=2, column='DayOfWeek', value=DayOfWeek)
print('\033[1m' + 'daily_activity' + '\033[0m')
display(daily_activity.head())
print('\033[1m' + 'hourly_calories' + '\033[0m')
display(hourly_calories.head())
print('\033[1m' + 'hourly_intensities' + '\033[0m')
display(hourly_intensities.head())
print('\033[1m' + 'hourly_steps' + '\033[0m')
display(hourly_steps.head())
print('\033[1m' + 'sleep_day' + '\033[0m')
display(sleep_day.head())
daily_activity
| Id | Date | DayOfWeek | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1503960366 | 2016-04-12 | Tuesday | 13162 | 8.50 | 8.50 | 0.0 | 1.88 | 0.55 | 6.06 | 0.0 | 25 | 13 | 328 | 728 | 1985 |
| 1 | 1503960366 | 2016-04-13 | Wednesday | 10735 | 6.97 | 6.97 | 0.0 | 1.57 | 0.69 | 4.71 | 0.0 | 21 | 19 | 217 | 776 | 1797 |
| 2 | 1503960366 | 2016-04-14 | Thursday | 10460 | 6.74 | 6.74 | 0.0 | 2.44 | 0.40 | 3.91 | 0.0 | 30 | 11 | 181 | 1218 | 1776 |
| 3 | 1503960366 | 2016-04-15 | Friday | 9762 | 6.28 | 6.28 | 0.0 | 2.14 | 1.26 | 2.83 | 0.0 | 29 | 34 | 209 | 726 | 1745 |
| 4 | 1503960366 | 2016-04-16 | Saturday | 12669 | 8.16 | 8.16 | 0.0 | 2.71 | 0.41 | 5.04 | 0.0 | 36 | 10 | 221 | 773 | 1863 |
hourly_calories
| Id | ActivityHour | Calories | |
|---|---|---|---|
| 0 | 1503960366 | 2016-04-12 00:00:00 | 81 |
| 1 | 1503960366 | 2016-04-12 01:00:00 | 61 |
| 2 | 1503960366 | 2016-04-12 02:00:00 | 59 |
| 3 | 1503960366 | 2016-04-12 03:00:00 | 47 |
| 4 | 1503960366 | 2016-04-12 04:00:00 | 48 |
hourly_intensities
| Id | ActivityHour | TotalIntensity | AverageIntensity | |
|---|---|---|---|---|
| 0 | 1503960366 | 2016-04-12 00:00:00 | 20 | 0.333333 |
| 1 | 1503960366 | 2016-04-12 01:00:00 | 8 | 0.133333 |
| 2 | 1503960366 | 2016-04-12 02:00:00 | 7 | 0.116667 |
| 3 | 1503960366 | 2016-04-12 03:00:00 | 0 | 0.000000 |
| 4 | 1503960366 | 2016-04-12 04:00:00 | 0 | 0.000000 |
hourly_steps
| Id | ActivityHour | StepTotal | |
|---|---|---|---|
| 0 | 1503960366 | 2016-04-12 00:00:00 | 373 |
| 1 | 1503960366 | 2016-04-12 01:00:00 | 160 |
| 2 | 1503960366 | 2016-04-12 02:00:00 | 151 |
| 3 | 1503960366 | 2016-04-12 03:00:00 | 0 |
| 4 | 1503960366 | 2016-04-12 04:00:00 | 0 |
sleep_day
| Id | Date | Time | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed | |
|---|---|---|---|---|---|---|
| 0 | 1503960366 | 2016-04-12 | 00:00:00 | 1 | 327 | 346 |
| 1 | 1503960366 | 2016-04-13 | 00:00:00 | 2 | 384 | 407 |
| 2 | 1503960366 | 2016-04-15 | 00:00:00 | 1 | 412 | 442 |
| 3 | 1503960366 | 2016-04-16 | 00:00:00 | 2 | 340 | 367 |
| 4 | 1503960366 | 2016-04-17 | 00:00:00 | 1 | 700 | 712 |
Merging the daily_activity and sleep_day dataframes on Id and Date column as the primary keys.
daily_activity_sleep = daily_activity.merge(sleep_day,on=['Id','Date'],how='left')
display(daily_activity_sleep)
| Id | Date | DayOfWeek | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories | Time | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1503960366 | 2016-04-12 | Tuesday | 13162 | 8.500000 | 8.500000 | 0.0 | 1.88 | 0.55 | 6.06 | 0.00 | 25 | 13 | 328 | 728 | 1985 | 00:00:00 | 1.0 | 327.0 | 346.0 |
| 1 | 1503960366 | 2016-04-13 | Wednesday | 10735 | 6.970000 | 6.970000 | 0.0 | 1.57 | 0.69 | 4.71 | 0.00 | 21 | 19 | 217 | 776 | 1797 | 00:00:00 | 2.0 | 384.0 | 407.0 |
| 2 | 1503960366 | 2016-04-14 | Thursday | 10460 | 6.740000 | 6.740000 | 0.0 | 2.44 | 0.40 | 3.91 | 0.00 | 30 | 11 | 181 | 1218 | 1776 | NaN | NaN | NaN | NaN |
| 3 | 1503960366 | 2016-04-15 | Friday | 9762 | 6.280000 | 6.280000 | 0.0 | 2.14 | 1.26 | 2.83 | 0.00 | 29 | 34 | 209 | 726 | 1745 | 00:00:00 | 1.0 | 412.0 | 442.0 |
| 4 | 1503960366 | 2016-04-16 | Saturday | 12669 | 8.160000 | 8.160000 | 0.0 | 2.71 | 0.41 | 5.04 | 0.00 | 36 | 10 | 221 | 773 | 1863 | 00:00:00 | 2.0 | 340.0 | 367.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 938 | 8877689391 | 2016-05-08 | Sunday | 10686 | 8.110000 | 8.110000 | 0.0 | 1.08 | 0.20 | 6.80 | 0.00 | 17 | 4 | 245 | 1174 | 2847 | NaN | NaN | NaN | NaN |
| 939 | 8877689391 | 2016-05-09 | Monday | 20226 | 18.250000 | 18.250000 | 0.0 | 11.10 | 0.80 | 6.24 | 0.05 | 73 | 19 | 217 | 1131 | 3710 | NaN | NaN | NaN | NaN |
| 940 | 8877689391 | 2016-05-10 | Tuesday | 10733 | 8.150000 | 8.150000 | 0.0 | 1.35 | 0.46 | 6.28 | 0.00 | 18 | 11 | 224 | 1187 | 2832 | NaN | NaN | NaN | NaN |
| 941 | 8877689391 | 2016-05-11 | Wednesday | 21420 | 19.559999 | 19.559999 | 0.0 | 13.22 | 0.41 | 5.89 | 0.00 | 88 | 12 | 213 | 1127 | 3832 | NaN | NaN | NaN | NaN |
| 942 | 8877689391 | 2016-05-12 | Thursday | 8064 | 6.120000 | 6.120000 | 0.0 | 1.82 | 0.04 | 4.25 | 0.00 | 23 | 1 | 137 | 770 | 1849 | NaN | NaN | NaN | NaN |
943 rows × 20 columns
Merging the hourly_(Calories, Intensities, Steps) dataframes using the Id and ActivityHour columns as primary keys to form a new dataframe.
#Merge hourly dataframes
hourly_metrics = hourly_calories.merge(hourly_intensities,on=['Id','ActivityHour'],how='inner')\
.merge(hourly_steps,on=['Id','ActivityHour'],how='inner')
#Rename columns
hourly_metrics = hourly_metrics.rename(columns={'ActivityHour': 'DateTime'})
hourly_metrics = hourly_metrics.rename(columns={'StepTotal': 'TotalSteps'})
display(hourly_metrics)
| Id | DateTime | Calories | TotalIntensity | AverageIntensity | TotalSteps | |
|---|---|---|---|---|---|---|
| 0 | 1503960366 | 2016-04-12 00:00:00 | 81 | 20 | 0.333333 | 373 |
| 1 | 1503960366 | 2016-04-12 01:00:00 | 61 | 8 | 0.133333 | 160 |
| 2 | 1503960366 | 2016-04-12 02:00:00 | 59 | 7 | 0.116667 | 151 |
| 3 | 1503960366 | 2016-04-12 03:00:00 | 47 | 0 | 0.000000 | 0 |
| 4 | 1503960366 | 2016-04-12 04:00:00 | 48 | 0 | 0.000000 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 22094 | 8877689391 | 2016-05-12 10:00:00 | 126 | 12 | 0.200000 | 514 |
| 22095 | 8877689391 | 2016-05-12 11:00:00 | 192 | 29 | 0.483333 | 1407 |
| 22096 | 8877689391 | 2016-05-12 12:00:00 | 321 | 93 | 1.550000 | 3135 |
| 22097 | 8877689391 | 2016-05-12 13:00:00 | 101 | 6 | 0.100000 | 307 |
| 22098 | 8877689391 | 2016-05-12 14:00:00 | 113 | 9 | 0.150000 | 457 |
22099 rows × 6 columns
This function provides an holistic overview of the dataframes to draw insights for analysis.
#Exclude Id column
cols = set(daily_activity_sleep.columns) - {'Id'}
summary_daily_activity = daily_activity_sleep[list(cols)]
summary_daily_activity.describe()
| ModeratelyActiveDistance | Calories | LightlyActiveMinutes | TotalMinutesAsleep | LoggedActivitiesDistance | TotalSleepRecords | SedentaryActiveDistance | VeryActiveDistance | TrackerDistance | TotalSteps | VeryActiveMinutes | LightActiveDistance | TotalDistance | TotalTimeInBed | FairlyActiveMinutes | SedentaryMinutes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 943.000000 | 943.000000 | 943.000000 | 413.000000 | 943.000000 | 413.000000 | 943.000000 | 943.000000 | 943.000000 | 943.000000 | 943.000000 | 943.000000 | 943.000000 | 413.000000 | 943.000000 | 943.000000 |
| mean | 0.570880 | 2307.507953 | 193.025451 | 419.467312 | 0.110045 | 1.118644 | 0.001601 | 1.504316 | 5.488547 | 7652.188759 | 21.239661 | 3.349258 | 5.502853 | 458.639225 | 13.628844 | 990.353128 |
| std | 0.884775 | 720.815522 | 109.308468 | 118.344679 | 0.622292 | 0.345521 | 0.007335 | 2.657626 | 3.909291 | 5086.532832 | 32.946264 | 2.046505 | 3.926509 | 127.101607 | 20.000746 | 301.262473 |
| min | 0.000000 | 0.000000 | 0.000000 | 58.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 61.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 1829.500000 | 127.000000 | 361.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.620000 | 3795.000000 | 0.000000 | 1.950000 | 2.620000 | 403.000000 | 0.000000 | 729.000000 |
| 50% | 0.240000 | 2140.000000 | 199.000000 | 433.000000 | 0.000000 | 1.000000 | 0.000000 | 0.220000 | 5.260000 | 7439.000000 | 4.000000 | 3.380000 | 5.260000 | 463.000000 | 7.000000 | 1057.000000 |
| 75% | 0.805000 | 2796.500000 | 264.000000 | 490.000000 | 0.000000 | 1.000000 | 0.000000 | 2.065000 | 7.715000 | 10734.000000 | 32.000000 | 4.790000 | 7.720000 | 526.000000 | 19.000000 | 1229.000000 |
| max | 6.480000 | 4900.000000 | 518.000000 | 796.000000 | 4.942142 | 3.000000 | 0.110000 | 21.920000 | 28.030001 | 36019.000000 | 210.000000 | 10.710000 | 28.030001 | 961.000000 | 143.000000 | 1440.000000 |
#Exclude Id column
cols = set(hourly_metrics.columns) - {'Id'}
summary_hourly_metrics = hourly_metrics[list(cols)]
summary_hourly_metrics.describe()
| TotalSteps | Calories | AverageIntensity | TotalIntensity | |
|---|---|---|---|---|
| count | 22099.000000 | 22099.000000 | 22099.000000 | 22099.000000 |
| mean | 320.166342 | 97.386760 | 0.200589 | 12.035341 |
| std | 690.384228 | 60.702622 | 0.352219 | 21.133110 |
| min | 0.000000 | 42.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 63.000000 | 0.000000 | 0.000000 |
| 50% | 40.000000 | 83.000000 | 0.050000 | 3.000000 |
| 75% | 357.000000 | 108.000000 | 0.266667 | 16.000000 |
| max | 10554.000000 | 948.000000 | 3.000000 | 180.000000 |
Create a distribution of the different activity levels by minutes:
fig, axes = plt.subplots(1, 3, figsize=(25, 6))
plt.style.use("seaborn-colorblind")
fig.suptitle("Distribution of Activity Types"
, fontsize=20, fontweight="bold", y="1.03")
min_ylim, max_ylim = plt.ylim()
# Plot Histogram for Lightly Active Minutes
axes[0].hist(daily_activity_sleep["LightlyActiveMinutes"],
histtype="bar", bins=10, edgecolor='black')
axes[0].set_xlabel("Lightly Active Minutes", fontsize=15)
axes[0].set_ylabel("No. of Records", fontsize=15)
axes[0].axvline(daily_activity_sleep["LightlyActiveMinutes"].mean()
, color='red', linestyle='dashed', linewidth=2)
axes[0].text(daily_activity_sleep["LightlyActiveMinutes"].mean()*1.05
, max_ylim*188, 'Mean: {:.2f}'.format(daily_activity_sleep["LightlyActiveMinutes"].mean()))
# Plot Histogram for Fairly Active Minutes
axes[1].hist(daily_activity_sleep["FairlyActiveMinutes"],
histtype="bar", color="y", bins=10, edgecolor='black')
axes[1].set_xlabel("Fairly Active Minutes", fontsize=15)
axes[1].set_ylabel("No. of Records", fontsize=15)
axes[1].axvline(daily_activity_sleep["FairlyActiveMinutes"].mean()
, color='red', linestyle='dashed', linewidth=2)
axes[1].text(daily_activity_sleep["FairlyActiveMinutes"].mean()*1.2
, max_ylim*645, 'Mean: {:.2f}'.format(daily_activity_sleep["FairlyActiveMinutes"].mean()))
# Plot Histogram for Very Active Minutes
axes[2].hist(daily_activity_sleep["VeryActiveMinutes"],
histtype="bar", color="g", bins=10, edgecolor='black')
axes[2].set_xlabel("Very Active Minutes", fontsize=15)
axes[2].set_ylabel("No. of Records", fontsize=15)
axes[2].axvline(daily_activity_sleep["VeryActiveMinutes"].mean()
, color='red', linestyle='dashed', linewidth=2)
axes[2].text(daily_activity_sleep["VeryActiveMinutes"].mean()*1.2
, max_ylim*645, 'Mean: {:.2f}'.format(daily_activity_sleep["VeryActiveMinutes"].mean()))
Text(25.487592788971366, 645.0, 'Mean: 21.24')
From the histograms above showed that the records of 'Lightly Active Minutes' is close to a normal distribution curve where there are higher occurences around the mean region. Users are also seen spending most of their time in the Lightly Active category(Examples of activities include: Gardening, Walking etc.) and lesser time in the Fairly Active and Very Active category (Example: high cardio activities such as running). The findings are reasonable given that the average user could be non-atheletes that may be using the device for daily lifestyle acivities and to clock occasional mid-high intensity activities.
#Average of activity levels
average_active_min = daily_activity_sleep[['VeryActiveMinutes', 'FairlyActiveMinutes',
'LightlyActiveMinutes', 'SedentaryMinutes']].mean()
activity_level_min = pd.DataFrame(average_active_min)
activity_level_min.reset_index(inplace=True)
activity_level_min = activity_level_min.rename(columns = {'index':'ActivityLevel', 0:'AverageMinutes'})
activity_level_min
| ActivityLevel | AverageMinutes | |
|---|---|---|
| 0 | VeryActiveMinutes | 21.239661 |
| 1 | FairlyActiveMinutes | 13.628844 |
| 2 | LightlyActiveMinutes | 193.025451 |
| 3 | SedentaryMinutes | 990.353128 |
#Plot a piechart to show the distribution of average time spent in each activity level
fig = px.pie(activity_level_min, values='AverageMinutes', names ='ActivityLevel',
title = "Average total time spent in each activity level")
fig.update_traces(textposition='inside')
From the pie chart, users are seen spending 16.5 hours being sedentary, 3.2 hours of their day being lightly active, 13.6 minutes being fairly active, and 21 minutes being very active daily.
Although users spent 21 minutes on average daily in intense activities, a significant amount of their day is spent being sedentary. This presents a lifestyle concern that has to be address or health conditions could surface in the long run which beats the purpose of owning Bellabeat's health and lifestyle devices.
daily_activity_sleep.columns
Index(['Id', 'Date', 'DayOfWeek', 'TotalSteps', 'TotalDistance',
'TrackerDistance', 'LoggedActivitiesDistance', 'VeryActiveDistance',
'ModeratelyActiveDistance', 'LightActiveDistance',
'SedentaryActiveDistance', 'VeryActiveMinutes', 'FairlyActiveMinutes',
'LightlyActiveMinutes', 'SedentaryMinutes', 'Calories', 'Time',
'TotalSleepRecords', 'TotalMinutesAsleep', 'TotalTimeInBed'],
dtype='object')
sort_days = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
calories = daily_activity_sleep.groupby("DayOfWeek").mean()['Calories'].reindex(sort_days)
avg_calories_dow = pd.DataFrame(calories)
avg_calories_dow.reset_index(inplace=True)
display(avg_calories_dow)
| DayOfWeek | Calories | |
|---|---|---|
| 0 | Monday | 2338.099174 |
| 1 | Tuesday | 2356.013158 |
| 2 | Wednesday | 2302.620000 |
| 3 | Thursday | 2204.297297 |
| 4 | Friday | 2331.785714 |
| 5 | Saturday | 2365.592000 |
| 6 | Sunday | 2263.000000 |
# plot bar plot for average calories burned by day of week
sns.set_style("darkgrid")
plt.figure(figsize=(8,4))
sns.set_context("notebook")
ax = sns.barplot(data=avg_calories_dow, x="DayOfWeek", y="Calories", ci=None, palette="RdBu")
plt.title("Average calories burned by day of week", fontsize=15, fontweight="bold")
plt.xlabel("")
plt.ylabel("Average Calories Burned", fontsize=15)
plt.xticks(rotation="45")
# display the value on each bar
ax = plt.gca()
for p in ax.patches:
ax.text(p.get_x() + p.get_width()/2, p.get_height(), '%d' % int(p.get_height()),
fontsize=12, color='black', ha='center', va='bottom')
Based on the bar chart plotted above, we can see that users burned a consistent amount of calories throughout the week with the lowest being on Thursday. However, the required amount of daily calories burned betwen men and women varies from gender, age, and lifestyle demographics of the sample population does not provide a holistic picture of the data. Nevertheless, according to the U.S. Department of Health and Human Services, the average adult women expends roughly 1,600 to 2,400 calories per day, and the average adult man uses 2,000 to 3,000 calories per day.
Furthermore, the average sedentary person burns approximately 1800 calories a day. Thus, the mean of 2307 calories (Refer to section 5.1) daily is reasonably accurate as the average user spend most of their time being sedentary while a small subset of very active users could be skewing the mean of the data.
hourly_metrics.head()
| Id | DateTime | Calories | TotalIntensity | AverageIntensity | TotalSteps | |
|---|---|---|---|---|---|---|
| 0 | 1503960366 | 2016-04-12 00:00:00 | 81 | 20 | 0.333333 | 373 |
| 1 | 1503960366 | 2016-04-12 01:00:00 | 61 | 8 | 0.133333 | 160 |
| 2 | 1503960366 | 2016-04-12 02:00:00 | 59 | 7 | 0.116667 | 151 |
| 3 | 1503960366 | 2016-04-12 03:00:00 | 47 | 0 | 0.000000 | 0 |
| 4 | 1503960366 | 2016-04-12 04:00:00 | 48 | 0 | 0.000000 | 0 |
df = hourly_metrics.groupby(hourly_metrics["DateTime"].dt.hour)["Calories"].mean()
print(df)
DateTime 0 71.805139 1 70.165059 2 69.186495 3 67.538049 4 68.261803 5 81.708155 6 86.996778 7 94.477981 8 103.337272 9 106.142857 10 110.460710 11 109.806904 12 117.197397 13 115.309446 14 115.732899 15 106.637158 16 113.327453 17 122.752759 18 123.492274 19 121.484547 20 102.357616 21 96.056354 22 88.265487 23 77.593577 Name: Calories, dtype: float64
fig = px.line(hourly_metrics.groupby(hourly_metrics["DateTime"].dt.hour)["Calories"].mean(),
title="Average of total calories burned hourly", markers=True, y="Calories")
fig.update_layout(xaxis={'range':[0,24]}, xaxis_title="Time of Day(Hour)", yaxis_title="Average of total calories Burned",
hoverlabel=dict(
bgcolor="white",
font_size=14,
font_family="Rockwell"
))
fig.update_traces(hovertemplate='Time of Day(Hour): %{x} <br> Average of Total Calories Burned: %{y}')
According to Sleep Foundation, it was found that we burn about 50 calories an hour while sleeping which is reflected on the graph above. There is an obvious trend where users begin to increase their calories burned gradually from 4am to mid-day. A slight drop in the amount of calories burned from 12pm to 1pm was also observed. This is likely due to the occurence of postprandial somnolence (A.K.A Food Coma) which usually happens after lunch between 1-3pm leading to fewer calories burned while being tired.
The calories burn is observed to begin increasing again at 4pm while reaching its peak at 6pm indicating that users could likely be choosing this hour to work out or commute after work/school hours. There is a significant decrease on the 7pm marks to 11pm which indicate that most users chose these period as their resting period until they are ready for bedtime.
steps = daily_activity_sleep.groupby("DayOfWeek")['TotalSteps'].mean().reindex(sort_days)
avg_steps_dow = pd.DataFrame(steps)
avg_steps_dow.reset_index(inplace=True)
display(avg_steps_dow)
| DayOfWeek | TotalSteps | |
|---|---|---|
| 0 | Monday | 7819.082645 |
| 1 | Tuesday | 8125.006579 |
| 2 | Wednesday | 7559.373333 |
| 3 | Thursday | 7420.682432 |
| 4 | Friday | 7448.230159 |
| 5 | Saturday | 8202.712000 |
| 6 | Sunday | 6933.231405 |
sns.set_style("darkgrid")
plt.figure(figsize=(8,6))
sns.set_context("notebook")
sns.boxplot(data=daily_activity_sleep, x="DayOfWeek", y="TotalSteps",
palette="colorblind", sym="", order=sort_days)
plt.title("Total steps by day of week", fontsize=15, fontweight="bold")
plt.xlabel("")
plt.ylabel("Total Steps", fontsize=15)
plt.xticks(rotation="45")
(array([0, 1, 2, 3, 4, 5, 6]), [Text(0, 0, 'Monday'), Text(1, 0, 'Tuesday'), Text(2, 0, 'Wednesday'), Text(3, 0, 'Thursday'), Text(4, 0, 'Friday'), Text(5, 0, 'Saturday'), Text(6, 0, 'Sunday')])
As observed from the boxplot, we can see that users clocked the highest amount of steps on Saturdays and the lowest average total step is on Sunday which could be likely a rest day for them. The median of steps took throughout the week varies but is rather consistent, hovering between the 6000-7000 range while the mean is at 7652 steps. This indicates that the dataset is fairly distibuted across the lowest to highest values.
Based on MedicineNet, here are the classification of activity levels based on the number of steps taken in a day:
The data above signals that the average Bellabeat user is classified as somewhat active despite spending a significant amount of their time being sedentary. MedicineNet also claimed that studies have shown improvement on blood sugar levels, lower blood pressure, improve symptoms of depression and anxiety for people who walk between 7,500 to 10,000 steps per day.
fig = px.line(hourly_metrics.groupby(hourly_metrics["DateTime"].dt.hour)["TotalSteps"].mean(),
title="Average of total steps taken hourly", markers=True, y="TotalSteps")
fig.update_layout(xaxis={'range':[0,24]}, xaxis_title="Time of Day",
yaxis_title="Average of Total Steps Taken")
#print("plotly express hovertemplate:", fig.data[0].hovertemplate)
fig.update_traces(hovertemplate='Time of Day: %{x} <br>Average of Total Steps: %{y}')
This line chart has a closely identical pattern as compared to the average calories line chart (Section 5.5) as generated above. Users are seen generally starting their day from 5am onwards and reducing the number of steps taken after 7pm.
px.defaults.template = "ggplot2"
px.defaults.color_continuous_scale = px.colors.sequential.Blackbody
px.defaults.width = 800
px.defaults.height = 600
fig = px.scatter(x=daily_activity_sleep["TotalSteps"], y=daily_activity_sleep["Calories"],
title=" Correlation betwen Total Steps and Calories",
labels=dict(x="Total Steps",y="Calories"))
fig.update_layout(
xaxis={'range':[0,32000]})
From the scatter plot, we can observe a positive linear relationship between both variables. This indicates that users burned more calories with higher steps taken. To further prove our analysis, we can write a linregress() code to find the R Value (Pearson's Correlation Coefficient) that determine the level of linear regression between both variables.
from scipy.stats import linregress
xs = daily_activity_sleep["TotalSteps"]
ys = daily_activity_sleep["Calories"]
res = linregress(xs,ys)
print(res)
LinregressResult(slope=0.08402718288211401, intercept=1664.5160890160178, rvalue=0.5929492519076744, pvalue=1.3086580542942342e-90, stderr=0.0037199124194150276, intercept_stderr=34.17491705694571)
As seen from the results, the linear regression have an r value of 0.6 indicating a strong linear relationship between both variables.
linregress() is also a useful function that provides the regression slope value, intercept, p value and standard error of the analysis. For the importance of our analysis, the regression slope measures the steepness of the linear relationship shown by a best fit line. The steeper the line, the higher the effect on change the x variable has on the y variable. In this case, for every 1 step users take, they would expend an average of 0.08 calories. The r value of 0.6 should not be taken as a face value of a strong relationship between both variables as the r value only computes the strength of a linear relationship.
# Mean of active distance level
activity_level_avg_dist = daily_activity_sleep[['SedentaryActiveDistance','LightActiveDistance',
'ModeratelyActiveDistance','VeryActiveDistance']].mean()
# covert into pandas dataframe
active_distance = pd.DataFrame(activity_level_avg_dist)
active_distance.reset_index(inplace=True)
active_distance = active_distance .rename(columns = {'index':'ActiveDistanceLevel', 0:'AverageActiveDistance'})
active_distance.head()
| ActiveDistanceLevel | AverageActiveDistance | |
|---|---|---|
| 0 | SedentaryActiveDistance | 0.001601 |
| 1 | LightActiveDistance | 3.349258 |
| 2 | ModeratelyActiveDistance | 0.570880 |
| 3 | VeryActiveDistance | 1.504316 |
sns.set_style("darkgrid")
plt.figure(figsize=(10,4))
sns.set_context("notebook")
ax = sns.barplot(x="ActiveDistanceLevel", y="AverageActiveDistance", data=active_distance, ci=None, palette="dark")
ax.set(xlabel="",ylabel="Average Distance")
plt.title("Average Distance of Activity Levels",fontsize=20)
plt.xticks(rotation=45)
ax = plt.gca()
for p in ax.patches:
ax.text(p.get_x() + p.get_width()/2., p.get_height(), '%f' % float(p.get_height()),
fontsize=14, color='black', ha='center', va='bottom')
This barchart depicts the average distance users clocked in the respective activity levels:
The highest distance of 3.35km is clocked in the lightly active level. This further reinforce our assumptions in Section 5.2 that users are likely wearing their watches for daily lifestyle activities (e.g walking, doing chores, gardening etc). The second highest distance clocked is at 1.5km in the very active level. Sedentary active clocked the lowest with a distance that is almost insignificant which makes sense as users are most likely inactive and not moving.
daily_activity_sleep['AwakeTimeInbed'] = daily_activity_sleep['TotalTimeInBed'] - daily_activity_sleep['TotalMinutesAsleep']
sleep = daily_activity_sleep.groupby("DayOfWeek")[['TotalTimeInBed','TotalMinutesAsleep',
'AwakeTimeInbed']].mean().reindex(sort_days)
sleep_dow = pd.DataFrame(sleep)
sleep_dow.reset_index(inplace=True)
display(sleep_dow)
| DayOfWeek | TotalTimeInBed | TotalMinutesAsleep | AwakeTimeInbed | |
|---|---|---|---|---|
| 0 | Monday | 456.170213 | 418.829787 | 37.340426 |
| 1 | Tuesday | 443.292308 | 404.538462 | 38.753846 |
| 2 | Wednesday | 470.030303 | 434.681818 | 35.348485 |
| 3 | Thursday | 435.800000 | 402.369231 | 33.430769 |
| 4 | Friday | 445.052632 | 405.421053 | 39.631579 |
| 5 | Saturday | 461.275862 | 420.810345 | 40.465517 |
| 6 | Sunday | 503.509091 | 452.745455 | 50.763636 |
sleep_dow.plot(x="DayOfWeek", kind="bar", figsize=(12,6), ylabel="Average of Total Mins")
plt.title("Average Time of Sleep Activity", fontsize=20, fontweight="bold")
plt.xlabel("")
plt.xticks(rotation=45)
plt.ylabel("Average of Total Mins", fontsize=15)
Text(0, 0.5, 'Average of Total Mins')
It is calculated that users have a mean sleep schedule of 419.5 minutes(~ 7hrs) that is consistent across the week and within the healthy range. The highest recorded mean time asleep was on Sundays (~ 7.5hrs) and the lowest was on Thursdays (~ 6.7 hrs).
Comparing this chart with section 5.6 (Total steps by day of week), we understand that the lowest total steps on average was also recorded on Sunday. This reinforces our assumption that Sundays are likely a rest day for users.
# Categorizing users based on their amount of sleep
def sleep_grp_if(TotalMinutesAsleep):
if (TotalMinutesAsleep > 420) :
return 'Adequate Sleep'
else:
return 'Inadequate Sleep'
sleep_amt = sleep_day.loc[:,("Id", "Date", "TotalMinutesAsleep")]
sleep_amt['sleep_type'] = sleep_amt['TotalMinutesAsleep'].apply(sleep_grp_if)
sleep_amt.head()
| Id | Date | TotalMinutesAsleep | sleep_type | |
|---|---|---|---|---|
| 0 | 1503960366 | 2016-04-12 | 327 | Inadequate Sleep |
| 1 | 1503960366 | 2016-04-13 | 384 | Inadequate Sleep |
| 2 | 1503960366 | 2016-04-15 | 412 | Inadequate Sleep |
| 3 | 1503960366 | 2016-04-16 | 340 | Inadequate Sleep |
| 4 | 1503960366 | 2016-04-17 | 700 | Adequate Sleep |
# Identifying the number of users for each sleep category
sleep_proportion = sleep_amt['sleep_type'].value_counts()
sleep_proportion = pd.DataFrame(sleep_proportion)
sleep_proportion.reset_index(inplace=True)
sleep_proportion = sleep_proportion.rename(columns = {'index':'sleep_type', 'sleep_type':'sleep_type_count'})
display(sleep_proportion)
| sleep_type | sleep_type_count | |
|---|---|---|
| 0 | Adequate Sleep | 230 |
| 1 | Inadequate Sleep | 183 |
#Plotting the piechart
fig = px.pie(sleep_proportion, values='sleep_type_count', names='sleep_type', title = "Proportion of users by sleep adequacy")
fig.update_traces(textposition="inside", labels=["Adequate Sleep","Inadequate Sleep"],textfont_size=20)
The piechart generated shows a generally balanced proportion of users with adequate and inadequate sleep. However, I believe there could be intiatives to encourage more users to get at least 7 hours of sleep.
# Categorizing users based on sleep hours
def sleep_grp_hrs(TotalMinutesAsleep):
if (TotalMinutesAsleep <= 420) :
return 'Less than 7hrs'
elif (TotalMinutesAsleep <=540):
return '7hrs to 9hrs'
else:
return 'More than 9hrs'
sleep_distribution = sleep_day.loc[:,("Id", "Date", "TotalMinutesAsleep")]
sleep_distribution['sleep_grp_hrs'] = sleep_distribution['TotalMinutesAsleep'].apply(sleep_grp_hrs)
sleep_distribution.head()
| Id | Date | TotalMinutesAsleep | sleep_grp_hrs | |
|---|---|---|---|---|
| 0 | 1503960366 | 2016-04-12 | 327 | Less than 7hrs |
| 1 | 1503960366 | 2016-04-13 | 384 | Less than 7hrs |
| 2 | 1503960366 | 2016-04-15 | 412 | Less than 7hrs |
| 3 | 1503960366 | 2016-04-16 | 340 | Less than 7hrs |
| 4 | 1503960366 | 2016-04-17 | 700 | More than 9hrs |
sleep_proportion_hrs = sleep_distribution['sleep_grp_hrs'].value_counts()
sleep_proportion_hrs = pd.DataFrame(sleep_proportion_hrs)
sleep_proportion_hrs.reset_index(inplace=True)
sleep_proportion_hrs = sleep_proportion_hrs.rename(columns = {'index':'sleep_grp_hrs', 'sleep_grp_hrs':'sleep_count'})
display(sleep_proportion_hrs)
| sleep_grp_hrs | sleep_count | |
|---|---|---|
| 0 | 7hrs to 9hrs | 191 |
| 1 | Less than 7hrs | 183 |
| 2 | More than 9hrs | 39 |
X1 = sleep_distribution.loc[sleep_distribution.sleep_grp_hrs == 'Less than 7hrs','TotalMinutesAsleep']
X2 = sleep_distribution.loc[sleep_distribution.sleep_grp_hrs == '7hrs to 9hrs','TotalMinutesAsleep']
X3 = sleep_distribution.loc[sleep_distribution.sleep_grp_hrs == 'More than 9hrs','TotalMinutesAsleep']
plt.figure(figsize=(14,8))
plt.hist(X1, color='r', label='Less than 7hrs', edgecolor='k',alpha=0.7, bins=20)
plt.hist(X2, color='g', label='7hrs to 9hrs', edgecolor='k',alpha=0.7, bins=20)
plt.hist(X3, color='b', label='More than 9hrs', edgecolor='k',alpha=0.7, bins=20)
plt.title('Distribution of Users Sleep Hours', fontsize=20, fontweight="bold")
plt.xlabel('Sleep Time (Minutes)', fontsize=15)
plt.ylabel('Frequency', fontsize=15)
plt.legend()
<matplotlib.legend.Legend at 0x1a298b97340>
Here, we breakdown the various sleep hours in a normal distribution curve, showing that majority of users get approximately 340 - 540 minutes (5.6hrs-9hrs) of sleep.
# Creating a dataframe containing correlation coefficients of variables in daily_activity_sleep
total_corr = daily_activity_sleep[["TotalSteps", "TotalDistance", "LoggedActivitiesDistance","VeryActiveDistance", "ModeratelyActiveDistance", "LightActiveDistance", "SedentaryActiveDistance", "VeryActiveMinutes", "FairlyActiveMinutes", "LightlyActiveMinutes", "SedentaryMinutes", "TotalMinutesAsleep", "TotalTimeInBed", "Calories"]].corr()
# plotting the heatmap
fig, ax = plt.subplots(figsize=(15,8))
sns.heatmap(total_corr, annot=True, fmt = '.2f', cmap="viridis")
plt.title("Correlation Heatmap of daily_activity dataset", fontsize = 25)
Text(0.5, 1.0, 'Correlation Heatmap of daily_activity dataset')
Finally, we ran a correlation heatmap to provide us an overview on the correlation levels across the variables within the daily_activity_sleep dataframe. Some of the relevant variables pairs identified with strong correlation (R > 0.6) are:
The most frequent physical intensity activities on a daily basis users spent on is in the lightly active level, with an average time spent of (~ 3.2hrs) and highest distance clocked of (3.35km) .
Although users spent 21 minutes on average in the Very Active category, 81% of their day is spent being sedentary which highlights a concern.
The average user burns 2307 calories and clocks 7652 steps per day.
The highest burned is 2365 calories on Saturdays and lowest burned is 2204 calories on Thursdays.
The average user burn the highest calories between 5pm-7pm.They gradually reduce from 7pm onwards.
The highest average number of steps clocked (8202 steps) are on Saturday and the lowest(6993 steps) are on Sundays.
The average user begins their day at 5am and clocked the highest number of steps between 5-7pm. They gradually reduce their activeness from 7pm onwards.
There is a strong positive linear relationship between total steps clocked and total calories burned.
Users have a consistent sleep schedule with a mean sleep hours of 419.5 minutes (~ 7hrs) across the week. The highest recorded mean time asleep was on Sundays (~ 7.5hrs) and the lowest was on Thursdays (~ 6.7 hrs).
Majority of users get approximately 340 - 540 minutes (5.6hrs-9hrs) of sleep.
44.3% of users have inadequate sleep hours(<7hours).
At least 5 relevant pairs of variables are found to have a strong correlation (r >0.6).
Some of this market segments could consist of:
With the informations above, allows Bellebeat Apps to provide a suitable program for the user. This will play a crucial role in identifying how the products are accepted in different segments which will ultimately influence how Bellebeat drive its marketing campaigns.
Bellabeat could allow users to configure the app and device settings that will serve as reminders and motivations to achieve their desired lifestyle goals. A notification via the app that will remind users to get active or practice a consistent bedtime routine. The push notifications could also include positive reinforcements to users by showing their progress achieved throughout the day or week with the data collected.
Bellabeat could incorporate breathing or mindfulness functions in the app to help users wind down their anxiety and stress levels before bed time. This functions could also be interlinked with notifications that would remind users to practice breathing and mindfulness activites before their scheduled bedtime. Provide a follow along video to demonstrate breathing and relaxation techniques that would improve one's sleep quality.